404 research outputs found

    Unleashing Fine-Grained Parallelism on Embedded Many-Core Accelerators with Lightweight OpenMP Tasking

    Get PDF
    In recent years, programmable many-core accelerators (PMCAs) have been introduced in embedded systems to satisfy stringent performance/Watt requirements. This has increased the urge for programming models capable of effectively leveraging hundreds to thousands of processors. Task-based parallelism has the potential to provide such capabilities, offering high-level abstractions to outline abundant and irregular parallelism in embedded applications. However, efficiently supporting this programming paradigm on embedded PMCAs is challenging, due to the large time and space overheads it introduces. In this paper we describe a lightweight OpenMP tasking runtime environment (RTE) design for a state-of-the-art embedded PMCA, the Kalray MPPA 256. We provide an exhaustive characterization of the costs of our RTE, considering both synthetic workload and real programs, and we compare to several other tasking RTEs. Experimental results confirm that our solution achieves near-ideal parallelization speedups for tasks as small as 5K cycles, and an average speedup of 12 × for real benchmarks, which is 60% higher than what we observe with the original Kalray OpenMP implementation

    Inductor losses estimation in DC-DC converters by means of averaging technique

    Get PDF
    A suitable inductor modeling for power electronic DC-DC converters is presented in this paper. It is developed with the aim of improving inductor losses estimation achievable by averaged models, which inherently neglect inductor current ripple. In order to account for its contribution to the overall inductor losses, an appropriate parallel resistance is thus enclosed into the inductor model, whose value should be chosen in accordance with the DC-DC converter operating conditions. This allows the development of improved averaged models of DC-DC converters, especially in terms of power losses estimation. The effectiveness of the proposed modeling approach has been validated through a simulation study, which refers to the case of a boost DC-DC converter and is performed by means of a suitable circuit simulator designed for rapid modelling of switching power systems (SIMetrix/SIMPLIS)

    A memory-centric approach to enable timing-predictability within embedded many-core accelerators

    Get PDF
    There is an increasing interest among real-time systems architects for multi- and many-core accelerated platforms. The main obstacle towards the adoption of such devices within industrial settings is related to the difficulties in tightly estimating the multiple interferences that may arise among the parallel components of the system. This in particular concerns concurrent accesses to shared memory and communication resources. Existing worst-case execution time analyses are extremely pessimistic, especially when adopted for systems composed of hundreds-tothousands of cores. This significantly limits the potential for the adoption of these platforms in real-time systems. In this paper, we study how the predictable execution model (PREM), a memory-aware approach to enable timing-predictability in realtime systems, can be successfully adopted on multi- and manycore heterogeneous platforms. Using a state-of-the-art multi-core platform as a testbed, we validate that it is possible to obtain an order-of-magnitude improvement in the WCET bounds of parallel applications, if data movements are adequately orchestrated in accordance with PREM. We identify which system parameters mostly affect the tremendous performance opportunities offered by this approach, both on average and in the worst case, moving the first step towards predictable many-core systems

    A RISC-V-based FPGA Overlay to Simplify Embedded Accelerator Deployment

    Get PDF
    Modern cyber-physical systems (CPS) are increasingly adopting heterogeneous systems-on-chip (HeSoCs) as a computing platform to satisfy the demands of their sophisticated workloads. FPGA-based HeSoCs can reach high performance and energy efficiency at the cost of increased design complexity. High-Level Synthesis (HLS) can ease IP design, but automated tools still lack the maturity to efficiently and easily tackle system-level integration of the many hardware and software blocks included in a modern CPS. We present an innovative hardware overlay offering plug-and-play integration of HLS-compiled or handcrafted acceleration IPs thanks to a customizable wrapper attached to the overlay interconnect and providing shared-memory communication to the overlay cores. The latter are based on the open RISC-V ISA and offer simplified software management of the acceleration IP. Deploying the proposed overlay on a Xilinx ZU9EG shows ≈ 20% LUT usage and ≈ 4× speedup compared to program execution on the ARM host core

    Longitudinal and Transverse Wakefields Simulations and Studies in Dielectric-Coated Circular Waveguides

    Get PDF
    In recent years, there has been a growing interest and rapid experimental progress on the use of e.m. fields produced by electron beams passing through dielectric-lined structures and on the effects they might have on the drive and witness bunches. Short ultra-relativistic electron bunches can excite very intense wakefields, which provide an efficient acceleration through the dielectric wakefield accelerators (DWA) scheme with higher gradient than that in the conventional RF LINAC. These beams can also generate high power narrow band THz coherent Cherenkov radiation. These high gradient fields may create strong instabilities on the beam itself causing issues in plasma acceleration experiments (PWFA), plasma lensing experiments and in recent beam diagnostic applications. In this work we report the results of the simulations and studies of the wakefields generated by electron beams at different lengths and charges passing on and off axis in dielectric-coated circular waveguides. We also propose a semi-analytical method to calculate these high gradient fields without resorting to time consuming simulations

    The Challenge of Time-Predictability in Modern Many-Core Architectures

    Get PDF
    The recent technological advancements and market trends are causing an interesting phenomenon towards the convergence of High-Performance Computing (HPC) and Embedded Computing (EC) domains. Many recent HPC applications require huge amounts of information to be processed within a bounded amount of time while EC systems are increasingly concerned with providing higher performance in real-time. The convergence of these two domains towards systems requiring both high performance and a predictable time-behavior challenges the capabilities of current hardware architectures. Fortunately, the advent of next-generation many-core embedded platforms has the chance of intercepting this converging need for predictability and high-performance, allowing HPC and EC applications to be executed on efficient and powerful heterogeneous architectures integrating general-purpose processors with many-core computing fabrics. However, addressing this mixed set of requirements is not without its own challenges and it is now of paramount importance to develop new techniques to exploit the massively parallel computation capabilities of many-core platforms in a predictable way

    The Treatment of Acute Diaphyseal Long-bones Fractures with Orthobiologics and Pharmacological Interventions for Bone Healing Enhancement: A Systematic Review of Clinical Evidence

    Get PDF
    The healing of long bones diaphyseal fractures can be often impaired and eventually end into delayed union and non-union. A number of therapeutic strategies have been proposed in combination with surgical treatment in order to enhance the healing process, such as scaffolds, growth factors, cell therapies and systemic pharmacological treatments. Our aim was to investigate the current evidence of bone healing enhancement of acute long bone diaphyseal fractures

    On the Effectiveness of OpenMP teams for Programming Embedded Manycore Accelerators

    Get PDF
    With the introduction of more powerful and massively parallel embedded processors, embedded systems are becoming HPC capable. In particular heterogeneous on-chip systems (SoC) that couple a general-purpose host processor to a many-core accelerator are becoming more and more widespread, and provide tremendous peak performance/watt, well suited to execute HPC-class programs. The increased computation potential is however traded off for ease programming. Application developers are indeed required to manually deal with outlining code parts suitable for acceleration, parallelize there efficiently over many available cores, and orchestrate data transfers to/from the accelerator. In addition, since most manycores are organized as a collection of clusters, featuring fast local communication but slow remote communication (i.e., to another cluster's local memory), the programmer should also take care of properly mapping the parallel computation so as to avoid poor data locality. OpenMP v4.0 introduces new constructs for computation offloading, as well as directives to deploy parallel computation in a cluster-aware manner. In this paper we assess the effectiveness of OpenMP v4.0 at exploiting the massive parallelism available in embedded heterogeneous SoCs, comparing to standard parallel loops over several computation-intensive applications from the linear algebra and image processing domains

    The Importance of Worst-Case Memory Contention Analysis for Heterogeneous SoCs

    Full text link
    Memory interference may heavily inflate task execution times in Heterogeneous Systems-on-Chips (HeSoCs). Knowing worst-case interference is consequently fundamental for supporting the correct execution of time-sensitive applications. In most of the literature, worst-case interference is assumed to be generated by, and therefore is estimated through read-intensive synthetic workloads with no caching. Yet these workloads do not always generate worst-case interference. This is the consequence of the general results reported in this work. By testing on multiple architectures, we determined that the highest interference generation traffic pattern is actually hardware dependant, and that making assumptions could lead to a severe underestimation of the worst-case (in our case, of more than 9x).Comment: Accepted for presentation at the CPS workshop 2023 (http://www.cpsschool.eu/cps-workshop
    • …
    corecore